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1 Introduction 

> 

O 

"^ The Zipf distribution [12] appears very often in practice when modelling natural as well 

_i. as man-made pehnomena. This is because of its simplicity and its suitability to capture 

^ the main sample characteristics. Between these characteristics one wants to enhace the 

O following: a) large probability at one in most of its parameter space, b) long right tail, c) 

^^ linearity in the log-log scale and d) scale-free. Nevertheless, in many cases the proportion 

of the first few positive integer values observed, differs considerably from the expected 

.!^ probabilities under the Zipf distribution. This is a consequence of the fact that, in those 

/\ situations, the linear behaviour only holds for large integer values. 



The Zipf distribution is a particular case of the discrete Power Law (PL) distribution. 
In [3] it appears more that twenty situations corresponding to different research areas 
where the PL distribution has been considered as a candidate distribution. The areas 
correspond to physics, biology, information sciences, social networking, engineering or 
social sciences. For example, in real world one observes that a few mega cities contain 
a population that is orders of magnitude larger than the mean population of cities, and 
a lot of citites have a much smaller population. In internet one observes that very few 
sites contain milions of links, but many sites have just one or two links, or that millions 
of users visit a few select sites, giving little attention to millions of other sites (see |2j). 
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Reserchers from linguistics, ecology, demography, economy, genetics or, more recently, 
social networking use the Zipf distribution to model data usually presented as rank data 
or frequencies of frequencies data. 

The objective of the paper is to define a two-parameter generalization of the Zipf distri- 
bution that is much more flexible in modelling the probability for the first positive integer 
values while at the same time allowing for both concave and convex representations of 
the first probabilities in the log-log scale. This is done by applying the transformation 
defined by Marshall and Olkin in 1997 [Sj. In Marshall and Olkin's original paper, the 
transformation is used to generalize the exponential and the WeibuU distributions. Later, 
the transformation has been used to generalize the Lomax, the Pareto or the Log-normal 
distributions. Several papers that appear in the last few years apply the generalizations 
in reliability, in time series and in censored data. See for instance |1] , O |6] or |7] . In 
[TT] that transformation is presented as a skewing mechanism, and several classes of 
unimodal and symmetric distributions are extended in that manner. 

The paper is organized as follows. Section 2 is devoted to the preliminars. In Section 
3 the Marshall-Olkin generalized Zipf distribution is defined and its main properties 
are presented. In Section 4, three large data sets from very different research areas are 
analyzed proving the usefulnes of the model presented. 



2 Preliminars 



2.1 The Zipfian (Zipf) distribution 

A random variable (r.v.) X is said to follow a Zipf distribution with scale parameter 
a > 1 if, and only if, its probability mass function (pmf ) is equal to: 

P{X = x) = ^, for a; = l,2,3,---, (1) 

where Cl*^) = ^^=^1 ^""^ is the Riemann zeta function. The Zipf distribution p^ is 
the particular case of the discrete PL distribution with support the positive integers 
larger than zero, and it can also be viewed as the discretization of the Pareto (Type 
I) distribution. The Zipf distribution is often suitable to fit data that correspond to 
frequencies of frequencies or to ranked data. These type of data show a widespread 
pattern in their measurements with a very large probability at one and a very small 
probability at some very large values. Moreover, from ([I]) one obtains that the Zipf 
distribution will be appropiate when the data show a linear pattern in a log-log scale, 
because 

log(P(X = kj) = -alog{x) - log(C(a)). 



The continuous PL distribution is the only continuous distribution such that the shape 
of the distribution curve does not depend on the scale on which the measures are taken 
[To]. For this reason, it and its discrete version are also konwn as scale-free distributions. 



Given that 



^^ ' k=i 



the survival or reliability function (SF) of the Zipf distribution is equal to: 

F{X) = P{X >x) = ^±1^. (2) 

where C{a,x) is the Hurwitz zeta function, defined as: 

+00 
aa,x) = Y,k-". (3) 

k=x 

The mean of a Zipf distribution is finite for a > 2, and the variance is finite only when 
a > 3. Assuming that a > 3, 

EiX).<^, and ^..(X)^ C(a.-2)C(a.)-(C(.-lff ^ 

C(«) (C(a))^ 



Often in practice the Zipf distribution fits well the probabilities for large values of X 
but not the probabilities for the smaller ones. Plotting the observed values in the log-log 
scale usually one observes a concave or a convex shape for the smallest values and the 
linearity holds only for values larger than a given positive value. In order to avoid this 
problem and increase the goodness of fit of the model, in this paper the Zipf distribution 
is generalized by means of the Marshall-Olkin transformation. 



2.2 The Marshall-Olkin transformation 

In 1997, Marshall-Olkin defined a method of generalizing a given probability distribution 
increasing the number of parameters by one. Assume that X is a r.v. with a given 
probability distribution with survival function F{x), for —00 < x < +00. The Marshall- 
Olkin extension of the initial family is defined to be the family of distributions with 
survival function equal to: 

— 6 F(x) - 

G{x) = ^ -- , -00 < a; < +00, /3 > 0, and /3 = 1 - /3. (4) 

1 — (3 F{x) 

The new family contains the initial family as a particular case, obtained when /3 = 1. 
The transformation proposed has an stability property in the sense that the result of 
applying twice the transformation is also in the extended model. 



3 The Marshall- Olkin Extended Zipf Distribution 



The Mashall-Olkin Extended Zipf model (MOEZipf) is defined to be the set of probabihty 
distributions with SF: 

G(x-a,f3)= ^:^ffl = /^C(«,3: + l) ^^^ ^^^ ^^ ,^. 



where F(a;) is the SF of the Zipf (a) distribution in pj). If y is ar.v. with a MOEZipf (a, (3) 
distribution, then its probabihty mass function (pmf) is equal to: 



P{Y = x) = G{x-l;a,/3)-G{x;a,(3) 

x-^PCia) 



[C(a)-/3C(«,a;)][C(«)-/3C(«,a; + l)] 



x = l,2,3,--- (6) 



In Figure pi one can see the pmf of the MOEZipf(a;, j3) distribution for a = 1.8 and 
different values of /3. It can be appreciated that the probability at one increases as /3 
tends to zero and decreases as (3 tends to infinity. This result is proved next, together 
with some other results that come from comparing the probabilities of the Zipf and the 
MOEZipf distributions. 

Proposition 3.1. The probability at one of a r.v. Y with a MOEZipf distribution is a 
decreasing function of /3 verifying that: 

a) P(Y = 1) tends to 1 when /3 tends to 

b) P{Y = 1) tends to when /3 tends to +oo. 

Proof: Taking into account that by ([S]), ({a, 1) = ({a) and that C('^)2) = Cl*^) ~ 1; and 
setting X = 1 in (|6|, one has that: 

P{Y = 1] ^ 



l + /3C(a,2)' 

which is a decreasing function of /3 that tends to zero when /3 tends to infinity and to 
one when /3 tends to zero. D 

Proposition 3.2. For large values of x, parameter (5 may be interpreted as the ratio be- 
tween the probabilities of a r.v. Y with a MOEZipf(a, (3) distribution and the probabilities 
of a r.v. X with a Zipf (a) distribution at x. 



Proof: Taking into account that when x tends to infinity ({a,x) tends to zero, one has 
that: 

x^+~ P{X = x) ^^+^ [({a) - f3C{a, x)] [({a) - /3C(a, x + 1)] 

D 



MOEZipH a - 1 .E . p -0.2 ) 



MOEZipf{ n - 1 .B . P-O.S ) 





MOEZipf(a-1.S . P-1 ) 
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Figure 1: Pmf's for the MOEZipf(a, /3) model for a 
/3 = 1, one obtains the Zipf(1.8) distribution. 



1.8 and p = 0.2, 0.8, 1, 1.2, 2 and 5. For 



Proposition 3.3. Let Y be a r.v. with a MOEZipf(a, (3) distribution and X be a r.v. 
with a Zipf(a) distribution. For any x > 1 one has that: if (3 = 1, P(Y = x) = P{X = 
x), if/3> 1, /3P{Y = x) >P{X = x), andif(5 <l, P{Y = x) >pP{X = x). 



Proof: 



If /3 = 1, the probabilities in ^ correspond to those of the Zipf(a) distribution. 

If /3 > 1, /3 < 0, given that for all x > 1 ({a,x) < ({a), one has that f3({a,x) > 
/3({a). Thus, 

0<C(«)-^C(a,x)</3C(«) ^ 0<[C(a)-^C(a,x)][C(a)-^C(a,x + l)]<(/3C(«)) 

x-"/3C(a) a;-" 1 

^ [C(«) - ^C(", x)] [({a) - ^C(«, x + l)]-W)^ 
^ P{Y = x) >P{X = x)^. 

If/3 < 1, /3 > 0, then /3C(a, x) <]3C{a). Thus, for any a; > 1, < C{a)-]3C{a,x) < 
C{a), which gives that: 

x-°/3 a;-"/3C(a) 
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C(a) - [C(a)-/3C(a,x)][C(a)-/3C(a,a; + l)] 
^ f3P{X = x)<P{Y = x). 

D 

The MOEZipf distribution is only linear in the log-log scale for large values of x as is 
proved in the following result: 

Proposition 3.4. Let Y be a r.v. with a MOEZipf (a, (3) distribution. For large values 
of X, log(P(F = x)) is a linear function o/log(x). 



Proof: The proof is a straightforward consequence of the fact that ([3| tends to zero when 
X tends to infinity. Hence, for large values of x, the denominator of pi) is approximatively 
equal to C(ct)^) cind thus one has that: 

log(P(r = x)) - -a\og{x) + log(/3) - log(C(«)). (7) 

D 

REMARK: The result is also true in a larger support of the distribution if a is large. 
This is because (Isj) is very small for large values of a if x > 2. 



Figure [3] shows the behaviour of the MOEZipf distribution in the log-log scale, for dif- 
ferent parameter values, together with the straight line obtained by changing ~ by = in 

0- 

Next proposition compares the ratio of two consecutive probabilities of a MOEZipf and 
a Zipf distributions with the same a value. As it can be appreciated in Figure [3} the 
ratio only shows important differences for small values of x. Moreover, for values of 
/? smaller (larger) than one, the ratios corresponding to the MOEZipf distribution are 
smaller (larger) than the ratios associated to the Zipf distribution. This result is stated 
in the next proposition. 

Proposition 3.5. IjY is a r.v. with a MOEZipf(a, P) distribution and X is a r.v. with 
a Zipf(a) distribution, one has that: 



1) If(3> (<)1, then 



2) V/3 > 0, 



P{Y = x + l) , ,P(X = x + l) 

— ^ > (<) — ^ -. 

P{Y = x) ^ ' P{X = x) 

,. P(Y = x + l) P(X = x + l) 

lim ; r = ; r . 

x^+oo P(Y = x) P{X = x) 



Proof: By (|6| one has that 

P{Y = x + l) ( X Y C(a)-/3C(«,a;) 



P{Y = x) \x + l) C(«)-/3C(a,a; + 2) 

P{X = x + l) C(«)-^C(«,a;) 



P{X = x) C(a)-/3C(«,a; + 2)' 



(8) 



Given that for any a > 1 and any a; > 1, C,{a^x + 2) < C(a,x), it is possible to state 
that if 0</3< 1 (/? > 1), 

^^ C(«)-(l-/3)C(a,x) 



C(a)-(l-/3)C(«,x + 2)' 



which proves point 1). Given that (pi) tends to zero when x tends to infinity, point 2) is 
a straight consequence of (Isj). D 



3.1 Parameter estimation 

In this subsection two ways of estimating the parameters of the MOEZipf(Q;, /3) distri- 
bution are considered. The maximum likelihood estimator (m.l.e) denoted by (a,/3). 
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Figure 2: Probabilities of the MOEZipf(a, /3) distribution in the log-log scale. Each row cor- 
responds to a different value of a and each column to a different value of /3. The parameter 
values considered are: a = 1.1, 1.5 and 3 and /3 = 0.5, 1, 2 and 5. 
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Figure 3: Ratio of two consecutive probabilities for the MOEZipf(Q, /3) and Zipf(Q) distributions 
with a equal to 1.1 and 2.5 and /3 equal to 0.5, 1, 1.5 and 2.5. 



is obtained by maximizing the corresponding log-likelihood function, that in that case 
takes the form: 



j=i 



l{a,l3;yi) = n\og{l3) + n\og{C{a)) - a^\og{yi) -^log{Cia) - l3C{a,x)) 

4 = 1 

n 

- 5^1og(C(«)-^C(«,a; + l)). 



1=1 



The second method of estimation considered consists of solving numerically the system 
of equations that comes from equating the observed and expected probabilities at one, 
and the sample mean to the expected value of the distribution. If one denotes by /i the 
observed frequency at one, and by y the sample mean, that leads to solving 



r(c(«,2)/3+i)-i = A 

\E{Y)=y, 



(9) 



which can be done by first solving the following equation in a: 



n- f: 



^cwE 



X 



-a+l 



h ■ C(«, 2) ^ ^ ^ [c(«) - ^^rtSf ^("' ^)] [^(") - 7rwC(«, ^ + 1)] ' 

and then estimating (3 by substituing the estimation of a in the first equation of ^. 
The solution obtained by this method is denoted by (a, (3). 



4 Data analysis 



In this section three sets of data are fitted by means of the Zipf and the MOEZipf models, 
and the results obtained are compared. All the data sets have a very large sample size 
and correspond to real data obtained from different areas. 



4.1 Example from Linguistics 

The data of this example corresponds to the frequency of occurrence of words in the 
novel Moby Dick by Herman Melville and can be found in: 



http://tuvalu.santafe.edu/~aaronc/powerlaws/data.htm 



This set of data was first analyzed in [12] which is the reference where the Zipf distribution 
was defined. More recently j3] and [3] have also considered this set of data. The first 
ones use the data set to compare real with random texts, and the second ones fit the 
data by means of a general PL distribution. The set contains the frequencies of a total 
of 18855 words and nearly 75% of the observations correspond to the first three positive 
integer values. The three most frequent words appear 14086, 6414 and 6260 times. The 
observations larger than or equal to 53 have been grouped, in order to be able to compare 
the two models by means of the x^ goodness-of-fit statistic. 

The m.l.e estimations of a obtained assuming the two different models do not differ 
considerably. Nevertheless, the m.l.e of parameter /3 under the MOEZipf model is 50% 
larger than under the Zipf model (/3 = 1). The reduction obtained in the x^ Pearson 
statistic using the proposed model instead of the Zipf model is equal to 79.64%. The AIC 
as well as the log-likelihood show that the generalized model estimating the parameters 
by maximum likelihood gives the best fit. In Figure 4.1 one can see the data together 
with the two fitted models in a log-log scale. 



Distrib. 


Par am. 


Estimat. 


log-like. 


X^ 


p-val. 


AIC 


Zipf 


a 


1.775 


-40196.00 


272.38 





80394.00 


MOEZipf 

(2nd method) 


a 


1.908 
1.429 


-40086.28 


62.96 


0.097 


80176.56 


MOEZipf 

(m.l.e) 


a 


1.944 
1.523 


-40082.42 


55.45 


0.293 


80168.83 



Table 1: Reults of fitting the variable: Frequency of occurence of words, in Moby Dick. 
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Moby Dick word frequencies 



Zipf[a=1.77 5 
MOEZipf(a=-1.94. p=1.52) 




km 



Figure 4: Observed and expected data in the log-log scale, for the word frequency data. 



4.2 Example from electronic mail 

Given a database containing different electronic mail addresses, one can count how 
many connections one address has had in a given period of time. The table of fre- 
quencies of such a r.v. tends to have large probability at one (most of the addresses 
only have one contact), and a very small probability at some large values (just few ad- 
dresses have lots of contacts). The data set analyzed in this example corresponds to the 
number of connections of a total of 225409 electronic addresses, and may be found in 
'http://snap.stanford.edu/data/email-EuAll.html. They were collected between October 
2003 and may 2005. In [8] this data set is analyzed by fitting a PL distribution in the 
tail of the distribution. 

Here 85% of the observations are equal to one. The observed probabilities of the first five 
values decrease very quickly and after these values, they decrease more slowly. The three 
addresses with the largest number of contacts have exactly 930, 871 and 854 contacts. 
After grouping the data larger than 65, the x^ statistic is reduced in a 93.74% by using 
the generalized model instead of the original by m.l. The AIC criterium concludes that 
the MOEZipf model is the best one. 



Figure |4.2| shows the observed and fitted probabilities in the log-log scale. It can be 
appreciated that the convex behaviour of the MOEZipf model gives place to a better fit, 
not only for the first values of the distribution but also for the values in the tail. 
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Distrib. 


Par am. 


Estimat. 


log-like. 


X' 


p-val. 


AIC 


Zipf 


a 


2.968 


-156765.21 


13714.84 





313532.42 


MOEZipf 
(2nd method) 


a 

/3 


2.126 
0.321 


-154526.75 


968.34 





309057.51 


MOEZipf 

(m.l.e.) 


/3 


2.284 
0.390 


-154399.82 


858.27 





308803.64 



Table 2: Results of fitting the variable: Number of relations by electronic mail. 
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Figure 5: Observed and expected data in the log-log scale, for the e-mail example. 



4.3 Example from citations 

The last example considered corresponds to the number of times that a given paper is 
cited in a given database. This is an important variable because it allows one to calculate 
the impact factor of the scientific journals. The database analyzed has a total of 32158 
papers in the area of High-energy physics, published in arXiv.org between January 1993 
and April 2003, and may be found in htt p://snap.stanford.edu/data/citHepPh.htmli 
This data set has also been analyzed in [8j. 

From that data one observes that the 26% of the probability corresponds to the first 
two values, meaning that more than a quarter of the papers are cited at most twice. 
For model fitting, we have grouped the values larger than 119. As in the previous 
examples. Table |3] indicates that the MOEZipf model estimating by m. 1. provides the 
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best fit. The inclusion of tlie /3 parameter implies a reduction of a 87.73% in the x^ 
statistic. It is important to note that (3 is equal to 13.1 which is a very large value, 



compared with the value of 1 that corresponds to the Zipf distribution. In Figure 4^ 
it is possible to appreciate that the MOEZipf model shows a concave behaviour that 
improves considerably the fit of the first values as well as the one in the tail of the 
distribution. 



Distrib. 


Param. 


Estim. 


log-like 


X'' 


p-val. 


AIC 


Zipf 


a 


1.421 


-105839.81 


13172.05 





211681.61 


MOEZipf 

(2nd met.) 


a 

/3 


2.214 
11.677 


-99490.01 


1615.25 





198984.03 


MOEZipf 

(m.l.e) 


a 

/3 


2.161 
13.058 


-99197.93 


816.62 





198399.87 



Table 3: Results of fitting the r.v.: Number of citations of a given paper. 



Citations of a paper 




Mx) 



Figure 6: Observed and expected data in the log-log scale, for the citations example. 



5 Conclusions 



The Marshall-Olkin transformation has proved to be useful for generalizing the Zipf dis- 
tribution both in terms of providing good properties as well as in terms of improving 
the goodness of fit obtained in the data sets analyzed. The extended model can show 
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concavity or convexity in the first part of tlie domain, as it is sliown by means of the 
examples. The hnear behaviour is always observed in the tail of the distribution. The 
extra-parameter also allows for ratios between two consecutive probabilities larger or 
smaller than the corresponding ratio of a Zipf distribution. In the three data sets con- 
sidered, the fittings obtained for the first values are considerably better than the ones 
corresponding to the Zipf distribution, but they are also better in the tail. The reduction 
in the x^ goodness-of-fit statistic has always been larger or equal than 80%. The AIC 
points out the MOEZipf model as the better model in all the examples considered. 

Acknowledgements. The authors want to thank D. Dominguez-Sal and J.LL. Larriba- 
Pey for their help in providing the data sets that are analyzed in this work and to 
J. Ginebra for his interesting comments and suggestions that helped to considerably 
improve the manuscript. The first author also wants to thank the Spanish Ministery 
of Science and Innovation for Grants. No. TIN2009-14560 and MTM-2010- 14887, and 
Generalitat de Catalunya for Grant No. SGR-1187. 



References 



[1] T. Alice and K. Jose (2005). Marshall-Olkin semi-weibull minification processes. 
Recent Advances in Statistical Theory and Applications, I pp. 6-17. 

[2] L.A. Adamic and B.A. Huberman (2002) Zipf's law in internet, Glottometrics, vol 
3, pp 143-150. 

[3] A. Clausset, C.R. Shalizi, and M.E. Newman (2009). Power-law distributions in 
empirical data, SIAM Review, vol. 51 pp. 661-703. 

[4] R. Ferrer and R.V. Sole (2002). Zipf's law and random texts. Advances in Complex 
Systems, vol 5, pp. 1-6. 

[5] M.E. Ghitany, E.K. Al-Hussaini, and R.A. Al-Jarallah, (2005). Marshall-Olkin ex- 
tended WeibuU distribution and its application to censored data. Journal of Applied 
Statistics, vol. 32, pp. 1025-1034. 

[6] M.E. Ghitany, F.A. Al-Awadhi, and L.A. Alkhalfan, (2007). Marshall-Olkin Ex- 
tended Lomax distribution and its application to censored data. Communications 
in Statistics- Theory and Methods, vol. 36, pp 1855-1866. 

[7] W. Gui (2013). A Marshall-Olkin Power Log-normal distribution and its application 
to survival data. International Journal of Statistics and Probability, vol. 2. (No.l) 
pp. 63-71. 

[8] J. Leskovec, (2008) Dynamics of Large Metworks. PhD thesis. School of Computer 
Science, Carnegie Mellon University. 



14 



[9] A.W. Marshall and I. Olkin (1997). A new method for adding a parameter to a 
family of distributions with application to the exponential and Weibull families, 
Biometrika, vol 84, pp 641-652. 

[10] M.E.J. Newman (2005). Power laws, Pareto distributions and Zipf's law. Contem- 
porary Physics, vol. 46 pp. 323-351. 

[11] F.J. Rubio and F.J. Mark (2012). On the Marshall-Olkin transformation as a skew- 
ing mechanism. Computational Statistics and Data Analysis, vol. 56 (No. 7) pp. 
2251-2257. 

[12] G.K. Zipf, (1949). Human behaviour and the principle of least effort, Addison- Wesley 
Press. 



15 



